The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ Arabic Texts
نویسندگان
چکیده
A model for the unsupervised segmentation and linguistic analysis of Arabic texts of Prophetic tradition (ḥadīṯs), SALAH, is proposed. The model automatically segments each text unit in a transmitter chain (isnād) and a text content (matn) and further analyses each segment according to two distinct pipelines: a set of regular expressions chunks transmitter chains in a graph labeled with the relation between transmitters, while a tailored, augmented version of the AraMorph morphological analyzer (RAM) analyzes and annotates lexically and morphologically the text content. A graph with relations among transmitters and a lemmatized text corpus, both in XML format, are the final output of the system, which can further feed the automatic generation of concordances of the texts with variable-sized windows. The model results can be useful for a variety of purposes, including retrieving information from ḥadīṯ texts, verify the relations between transmitters, finding variant readings, supplying lexical information to specialized dictionaries.
منابع مشابه
Document Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملRealization of Minimum Discursive Units Segmentation of Arab Oral Utterances
Unlike the written texts, discourse segmentation of the Arab oral dialogues is a challenging task that is held back in most cases by the spontaneous character of oral speech. Like any segmentation task, segmentation in minimum discursive units (UDM) aims to cut the different statements of a speech into simple proposals easily usable in subsequent treatment. The majority of the work on the Arabi...
متن کاملPersian Printed Document Analysis and Page Segmentation
This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...
متن کاملItalian arabic linguistic tools
This paper concerns our participation in the research project: ‘Corpus bilingue Italiano Bilingual Italian – Arabic corpus) funded by law 488/92. The purpose of this project is to develop some linguistic tools and resources for bilingual Italian/Arabic corpora; its background and starting point are tools that have already been developed by the Computational Linguistics Institute. As far as IT t...
متن کاملAnnotating events, Time and Place Expressions in Arabic Texts
We present in this paper an unsupervised approach to recognize events, time and place expressions in Arabic texts. Arabic is a resource –scarce language and we don’t easily have at hand annotated corpora, lexicons and other needed NLP tools. We show in this work that we can recognize events, time and place expressions in Arabic texts without using a POS annotated corpus and without lexicon. We ...
متن کامل